Building and Installing a Hadoop/MapReduce Cluster from Commodity Components
نویسندگان
چکیده
This tutorial presents a recipe for the construction of a compute cluster for processing large volumes of data, using cheap, easily available personal computer hardware (Intel/AMD based PCs) and freely available open source software (Ubuntu Linux, Apache Hadoop). Introduction This article describes a straightforward way to build, install and operate a compute cluster from commodity hardware. A compute cluster is a utility that allows you to perform larger‐scale computations faster than with individual PCs. We use commodity components (called “nodes”) to keep the price down and to ensure easy availability of initial setup and replacement parts, and we use Apache Hadoop as a middleware for distributed data storage and parallel computing. Background At the time of writing, single desktop computers and even mobile devices have become faster than the supercomputers of the past. At the same time, storage capacities of disk drives have been increasing by multiple orders of magnitude. As a result of mass production, prices have decreased and the number of users of such commodity machines has increased. At the same time, pervasive networking has become available and has led to the distribution and sharing of data, leading to distributed communication, creation, consumption, and collaboration. Perhaps paradoxically, the ever‐increasing amount of digital ore powerful machine storage and networking is content that is the result of m 1 Jochen L. Leidner, Ph.D. is a Research Scientist in the corporate Research and Development group at Thomson Reuters and a Director at Linguit Ltd. He holds a doctorate degree in Informatics from the University of Edinburgh, where he has been a Royal Society Enterprise Fellow in Electronic Markets and postdoctoral Research Fellow, and two Master’s degrees in Computational Linguistics and English Language and Literature, and Computer Speech, respectively. His research interests include natural language processing, search engines, statistical data mining and software engineering. Jochen is a member of ACM, ACL and SIGIR, co‐authored over twenty peer‐reviewed papers and several patent applications. 2 Gary Berosik is a Lead Software Engineer at Thomson Reuters Research and Development and an Adjunct Faculty member in the Graduate Programs in Software at the University of St. Thomas in St. Paul, MN. His interests include software engineering, parallel/grid/cloud processing, statistical machine learning algorithms, learning‐support technologies, agent‐based architectures and technologies supporting business intelligence and information analytics. leading to ever‐increasing processing demands to find information and make sense of activities, preferences, and trends. The analysis of large networks such as the World Wide Web (WWW) is such a daunting task that it can only be carried out on a network of machines. In the 1990s, Larry Page, Sergey Brin and others at Stanford University used a large number of commodity machines in a research project that attempted to crawl a copy of the entire WWW and analyze its content and hyperlink graph structure. The Web quickly grew, becoming too large for human‐edited directories (e.g. Yahoo) to efficiently and effectively point people at the information they were looking for. In response, Digital Equipment Corporation (DEC) proposed the creation of a keyword index of all Web pages, motivated by their desire to show the power of their 64‐bit Alpha processor. This effort became known as the AltaVista search engine. Later, the aforementioned Stanford group developed a more sophisticated search engine named BackRub, later renamed Google. Today, Google is a search and advertising company, but is able to deliver its innovative services only due to massive investments in the large‐scale distributed storage and processing capability developed in‐house. This capability is provided by a large number of commodity off‐the‐shelf (COTS) PCs, the Google File System (GFS), a redundant cluster file system, and MapReduce ‐ parallel data processing middleware. More recently, the Apache Hadoop project has developed a reimplementation of parts of GFS and MapReduce, and many groups have subsequently embraced this technology, permitting them to do thing that they could not do on single machines. Procurement We choose the GNU/Linux operating system because it is very efficient, scalable, stable, secure, is available in source code without licensing impediments, and has a large user base, which ensures rapid responses to support questions. We select the Ubuntu distribution of the operating system because it has good support, a convenient package management system, and is offered in a server edition that contains only server essentials (Ubuntu Server). At the time of writing, release 9.04 was current. Little, if anything, of the described approach depends on this ion. particular vers
منابع مشابه
Adaptive Dynamic Data Placement Algorithm for Hadoop in Heterogeneous Environments
Hadoop MapReduce framework is an important distributed processing model for large-scale data intensive applications. The current Hadoop and the existing Hadoop distributed file system’s rack-aware data placement strategy in MapReduce in the homogeneous Hadoop cluster assume that each node in a cluster has the same computing capacity and a same workload is assigned to each node. Default Hadoop d...
متن کاملDesign and Implement of Distributed Document Clustering Based on MapReduce
In this paper, we describe how document clustering for large collection can be efficiently implemented with MapReduce. Hadoop implementation provides a convenient and flexible framework for distributed computing on a cluster of commodity machines. The design and implementation of tfidf and K-Means algorithm on MapReduce is presented. More importantly, we improved the efficiency and effectivenes...
متن کاملScalable Language Processing Algorithms for the Masses: A Case Study in Computing Word Co-occurrence Matrices with MapReduce
This paper explores the challenge of scaling up language processing algorithms to increasingly large datasets. While cluster computing has been available in industrial environments for several years, academic researchers have fallen behind in their ability to work on large datasets. We discuss two challenges contributing to this problem: lack of a suitable programming model for managing concurr...
متن کاملHEBR: A High Efficiency Block Reporting Scheme for HDFS
Hadoop platform is widely being used for managing, analyzing and transforming large data sets in various systems. Two basic components of Hadoop are: 1) a distributed file system (HDFS) 2) a computation framework (MapReduce). HDFS stores data on simple commodity machines that run DataNode processes (DataNodes). A commodity machine running NameNode process (NameNode) maintains meta data informat...
متن کاملMR-Tree - A Scalable MapReduce Algorithm for Building Decision Trees
Learning decision trees against very large amounts of data is not practical on single node computers due to the huge amount of calculations required by this process. Apache Hadoop is a large scale distributed computing platform that runs on commodity hardware clusters and can be used successfully for data mining task against very large datasets. This work presents a parallel decision tree learn...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/0911.5438 شماره
صفحات -
تاریخ انتشار 2009